# Image Captioning
Vit GPT2 Image Captioning
An image captioning model based on the ViT-GPT2 architecture, capable of generating natural language descriptions for input images.
Image-to-Text
Transformers

V
mo-thecreator
17
0
Idefics3 8B Llama3
Apache-2.0
Idefics3 is an open-source multimodal model capable of processing arbitrary sequences of image and text inputs to generate text outputs. It shows significant improvements in OCR, document understanding, and visual reasoning.
Image-to-Text
Transformers English

I
HuggingFaceM4
45.86k
277
Nebula
MIT
This model is an image-to-text model, focusing on generating captions for images.
Image Generation
Transformers

N
SRDdev
17
0
Kosmos 2 Patch14 24 Dup Ms
MIT
Kosmos-2 is a multimodal large language model capable of integrating visual information with language understanding to achieve image-to-text conversion and visual grounding tasks.
Image-to-Text
Transformers

K
ishaangupta293
21
0
Kosmos 2 Patch14 224
MIT
Kosmos-2 is a multimodal large language model capable of understanding and generating text descriptions related to images, and establishing associations between text and image regions.
Image-to-Text
Transformers

K
microsoft
171.99k
162
Blip2 Test
MIT
BLIP-2 is a vision-language model based on OPT-2.7b, which achieves image-to-text generation by freezing the image encoder and large language model while training a query transformer.
Image-to-Text
Transformers English

B
advaitadasein
18
0
Kosmos 2 Patch14 224
Kosmos-2 is a multimodal large language model capable of grounding language models to real-world visual elements, supporting various vision-language tasks.
Image-to-Text
Transformers

K
ydshieh
62
54
Blip2 Flan T5 Xxl
MIT
BLIP-2 is a vision-language model that combines an image encoder with a large language model for image-to-text tasks.
Image-to-Text
Transformers English

B
LanguageMachines
22
1
Swin Aragpt2 Image Captioning V3
An image captioning model based on Swin Transformer and AraGPT2 architecture, capable of generating textual descriptions for input images.
Image-to-Text
Transformers

S
AsmaMassad
18
0
Blip2 Flan T5 Xl Sharded
MIT
This is a sharded version of the BLIP-2 model implemented with Flan T5-xl for image-to-text tasks such as image captioning and visual question answering. Sharding allows it to be loaded in low-memory environments.
Image-to-Text
Transformers English

B
ethzanalytics
71
6
Blip2 Opt 6.7b
MIT
BLIP-2 is a vision-language model based on OPT-6.7b, pretrained by freezing the image encoder and large language model, supporting tasks such as image-to-text generation and visual question answering.
Image-to-Text
Transformers English

B
Salesforce
5,871
76
Blip2 Flan T5 Xl
MIT
BLIP-2 is a vision-language model based on Flan T5-xl, pre-trained by freezing the image encoder and large language model, supporting tasks such as image captioning and visual question answering.
Image-to-Text
Transformers English

B
Salesforce
91.77k
68
Textcaps Teste2
MIT
GIT is a Transformer-based image-to-text generation model trained on large-scale image-text pairs, capable of performing tasks such as image captioning and visual question answering.
Image-to-Text
Transformers Supports Multiple Languages

T
artificialguybr
26
3
Git Large
MIT
GIT is a dual-conditional Transformer decoder based on CLIP image tokens and text tokens for image-to-text generation tasks
Image-to-Text
Transformers Supports Multiple Languages

G
microsoft
1,404
15
Git Base
MIT
GIT is a dual-conditional Transformer decoder based on CLIP image tokens and text tokens, designed for image-to-text generation tasks.
Image-to-Text
Transformers Supports Multiple Languages

G
microsoft
365.74k
93
Featured Recommended AI Models